Search Result

Select

Dynamic task dispatching strategy for stream processing based on flow network

LI Ziyang, YU Jiong, BIAN Chen, LU Liang, PU Yonglin

Journal of Computer Applications 2018, 38 (9): 2560-2567. DOI: 10.11772/j.issn.1001-9081.2017122910

Abstract （1189）

PDF （1352KB）（416）

Save

Concerning the problem that sharp increase of data input rate leads to the rising of computing latency which influences the real-time of computing in big data stream processing platform, a dynamic dispatching strategy based on flow network was proposed and applied to a data stream processing platform named Apache Flink. Firstly, a Directed Acyclic Graph (DAG) was transformed to a flow network by defining the capacity and flow of every edge and a capacity detection algorithm was used to ascertain the capacity value of every edge. Secondly, a maximum flow algorithm was used to acquire the improved network and the optimization path in order to promote the throughput of cluster when the data input rate is increasing; meanwhile the feasibility of the algorithm was proved by evaluating its time-space complexity. Finally, the influence of an important parameter on the algorithm execution was discussed and recommended parameter values of different types of jobs were obtained by experiments. The experimental results show that the throughput promoting rate of the strategy is higher than 16.12% during the increasing phases of the data input rate in different types of benchmarks compared with the original dispatching strategy of Apache Flink, so the dynamic dispatching strategy efficiently promotes the throughput of cluster under the premise of task latency constraint.

Reference | Related Articles | Metrics

Select

Task scheduling algorithm based on weight in Storm

LU Liang, YU Jiong, BIAN Chen, YING Changtian, SHI Kangli, PU Yonglin

Journal of Computer Applications 2018, 38 (3): 699-706. DOI: 10.11772/j.issn.1001-9081.2017082125

Abstract （559）

PDF （1385KB）（579）

Save

Apache Storm, a typical platform for big data stream computing, uses a round-robin scheduling algorithm as the default scheduler, which does not consider the fact that differences of computational and communication cost are ubiquitous among different tasks and different data streams in one topology. Hence optimization is needed in terms of load balance and communication cost. To solve this problem, a Task Scheduling Algorithm based on Weight in Storm (TSAW-Storm) was proposed. In the algorithm, CPU occupation was taken as the weight of a task in a specific topology, and similarly tuple rate between a pair of tasks was taken as the weight of a data stream. Then tasks were assigned to the most suitable work node gradually by maximizing the gain of weight of data streams via transforming inter-node data streams into intra-node ones as many as possible with load balance ensured in order to reduce network overhead. Experimental results show that TSAW-Storm can reduce latency and inter-node tuple rate by about 30.0% and 32.9% respectively, and standard deviation of CPU load of work nodes is only 25.8% when compared to Storm default scheduling algorithm in WordCount benchmark with 8 work nodes. Additionally, online scheduler is deployed in contrast experiment. Experimental results show that TSAW-Storm can reduce latency, inter-node tuple rate and standard deviation of CPU load by about 7.76%, 11.8% and 5.93% respectively, which needs only a bit of executive overhead compared to online scheduler. Therefore, the proposed algorithm can reduce communication cost as well as improve load balance effectively, which makes a great contribution to the efficient operation of Apache Storm.

Reference | Related Articles | Metrics

Select

Partitioning and mapping algorithm for in-memory computing framework based on iterative filling

BIAN Chen, YU Jiong, XIU Weirong, YING Changtian, QIAN Yurong

Journal of Computer Applications 2017, 37 (3): 647-653. DOI: 10.11772/j.issn.1001-9081.2017.03.647

Abstract （446）

PDF （1133KB）（382）

Save

Focusing on the issue that the only one Hash/Range partitioning strategy in Spark usually results in unbalanced data load at Reduce phase and increases job duration sharply, an Iterative Filling data Partitioning and Mapping algorithm (IFPM) which include several innovative approaches was proposed. First of all, according to the analysis of job execute scheme of Spark, the job efficiency model and partition mapping model were established, the definitions of job execute timespan and allocation incline degree were given. Moreover, the Extendible Partitioning Algorithm (EPA) and Iterative Mapping Algorithm (IMA) were proposed, which reserved partial data into extend region by one-to-many partition function at Map phase. Data in extended region would be mapped by extra iterative allocation until the approximate data distribution was obtained, and the adaptive mapping function was executed by awareness of calculated data size at Reduce phase to revise the unbalanced data load in original region allocation. Experimental results demonstrate that for any distribution of the data, IFPM promotes the rationality of data load allocation from Map phase to Reduce phase and optimize the job efficiency of in-memory computing framework.

Reference | Related Articles | Metrics

Select

Dynamic data stream load balancing strategy based on load awareness

LI Ziyang, YU Jiong, BIAN Chen, WANG Yuefei, LU Liang

Journal of Computer Applications 2017, 37 (10): 2760-2766. DOI: 10.11772/j.issn.1001-9081.2017.10.2760

Abstract （759）

PDF （1299KB）（853）

Save

Concerning the problem of unbalanced load and incomplete comprehensive evaluation of nodes in big data stream processing platform, a dynamic load balancing strategy based on load awareness algorithm was proposed and applied to a data stream processing platform named Apache Flink. Firstly, the computational delay time of the nodes was obtained by using the depth-first search algorithm for the Directed Acyclic Graph (DAG) and regarded as the basis for evaluating the performance of the nodes, and the load balancing strategy was created. Secondly, the load migration technology for data stream was implemented based on the data block management strategy, and both the global and local load optimization was implemented through feedback. Finally, the feasibility of the algorithm was proved by evaluating its time-space complexity, meanwhile the influence of important parameters on the algorithm execution was discussed. The experimental results show that the proposed algorithm increases the efficiency of the task execution by optimizing the load sharing between nodes, and the task execution time is shortened by 6.51% averagely compared with the traditional load balancing strategy of Apache Flink.

Reference | Related Articles | Metrics

Select

Parallel access strategy for big data objects based on RAMCloud

CHU Zheng, YU Jiong, LU Liang, YING Changtian, BIAN Chen, WANG Yuefei

Journal of Computer Applications 2016, 36 (6): 1526-1532. DOI: 10.11772/j.issn.1001-9081.2016.06.1526

Abstract （550）

PDF （1195KB）（395）

Save

RAMCloud only supports the small object storage which is not larger than 1 MB. When the object which is larger than 1 MB needs to be stored in the RAMCloud cluster, it will be constrained by the object's size. So the big data objects can not be stored in the RAMCloud cluster. In order to resolve the storage limitation problem in RAMCloud, a parallel access strategy for big data objects based on RAMCloud was proposed. Firstly, the big data object was divided into several small data objects within 1 MB. Then the data summary was created in the client. The small data objects which were divided in the client were stored in RAMCloud cluster by the parallel access strategy. On the stage of reading, the data summary was firstly read, and then the small data objects were read in parallel from the RAMCloud cluster according to the data summary. Then the small data objects were merged into the big data object. The experimental results show that, the storage time of the proposed parallel access strategy for big data objects can reach 16 to 18 μs and the reading time can reach 6 to 7 μs without destroying the architecture of RAMCloud cluster. Under the InfiniBand network framework, the speedup of the proposed paralled strategy almost increases linearly, which can make the big data objects access rapidly and efficiently in microsecond level just like small data objects.

Reference | Related Articles | Metrics

Select

Link prediction algorithm based on node importance in complex networks

CHEN Jiaying, YU Jiong, YANG Xingyao, BIAN Chen

Journal of Computer Applications 2016, 36 (12): 3251-3255. DOI: 10.11772/j.issn.1001-9081.2016.12.3251

Abstract （889）

PDF （902KB）（876）

Save

Enhancing the accuracy of link prediction is one of the fundamental problems in the research of complex networks. The existing node similarity-based prediction indexes do not make full use of the importance influences of the nodes in the network. In order to solve the above problem, a link prediction algorithm based on the node importance was proposed. The node degree centrality, closeness centrality and betweenness centrality were used on the basis of similarity indexes such as Common Neighbor (CN), Adamic-Adar (AA) and Resource Allocation (RA) of local similarity-based link prediction algorithm. The link prediction indexes of CN, AA and RA with considering the importance of nodes were proposed to calculate the node similarity. The simulation experiments were taken on four real-world networks and Area Under the receiver operation characteristic Curve (AUC) was adopted as the standard index of link prediction accuracy. The experimental results show that the link prediction accuracies of the proposed algorithm on four data sets are higher than those of the other comparison algorithms, like Common Neighbor (CN) and so on. The proposed algorithm outperforms traditional link prediction algorithm and produces more accurate prediction on the complex network.

Reference | Related Articles | Metrics